Lesson 2: Intro to Data Visualization

Lesson Overview

  • Python's visualization landscape
  • Simple plots with pandas
  • Statistical plots with seaborn
    • Relational, categorical, and distribution plots
    • Semantic mappings
    • Facets
    • Statistical transformations
  • Interactive plots with plotly

For additional resources, check out the following:

Note: The seaborn library just released a major update (version 0.11) in September 2020 and the tutorials listed above are for this latest version. If you're using the previous version of seaborn (version 0.10), which is what is installed on Syzygy, some of the syntax shown in the tutorials won't work for you. If you'd like to work through these tutorials, you'll need install the latest version of seaborn locally on your own computer:

  • If you haven't already, follow the Anaconda installation instructions on our software setup page.
  • Check which version of seaborn is installed by running the following command in the console:
    conda list seaborn
  • If the version number starts with 0.11, you're all set!
  • If not, then upgrade to version 0.11 by running the following command in the console:
    conda install -c conda-forge seaborn=0.11
  • If this works, then you're all set! If you get an error, then a version 0.11 conda build of seaborn might not yet be available for your operating system. In that case, you can try the following: uninstall seaborn with conda remove and then install it with pip inside your conda environment.

Feel free to reach out to me on Slack or email for help with any of the above steps!

Setup

Pick up where we left off in the previous lesson:

In [1]:
import pandas

world = pandas.read_csv('https://raw.githubusercontent.com/jenfly/datajam-python/master/data/gapminder.csv')
world['pop_millions'] = world['population'] / 1e6
world_2015 = world[world['year'] == 2015]

Data Visualization Libraries

viz_libraries

Plus many, many, many more!

The Broader Landscape

viz_libraries

Image credit: Jake Vanderplas

matplotlib & seaborn

  • matplotlib is a robust, detail-oriented, low level plotting library.
  • seaborn provides high level functions on top of matplotlib.
    • Create attractive figures with customized themes.
    • Statistical data visualization - syntax focuses on expressing what is being explored in the underlying data rather than what graphical elements to add to the plot.

matplotlib_seaborn

plotly

  • Interactive plots and dashboards

plotly

Simple Plots with Pandas

If you want to quickly generate a simple plot, you can use the DataFrame's plot() method to generate a matplotlib-based plot with useful defaults and labels.

Let's use this method to create a bar chart of the total population in each world region.

In [2]:
region_pop = world_2015.groupby('region', as_index=False)['pop_millions'].sum()
region_pop
Out[2]:
region pop_millions
0 Africa 1191.9177
1 Americas 982.6889
2 Asia 4391.6350
3 Europe 740.4830
4 Oceania 38.4860
In [3]:
region_pop.plot(x='region', y='pop_millions', kind='bar')
Out[3]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f35905e7640>

The plot() method returns a matplotlib.Axes object, which is displayed as cell output. To suppress displaying this output, add a semi-colon to the end of the command.

In [4]:
region_pop.plot(x='region', y='pop_millions', kind='bar');

We can create different kinds of plots using the kind keyword argument, such as scatter and line plots, histograms, and others.

Let's use the world_2015 DataFrame to create a scatter plot of life expectancy vs. GDP per capita

  • Each point on the plot represents a single country in 2015
In [5]:
world_2015.plot(x='gdp_per_capita', y='life_expectancy', kind='scatter');
  • Other keyword arguments can be used to customize the figures generated by pandas
    • e.g. Figure size, title, axes properties, colours, etc.
    • To learn more, you can view the documentation in Jupyter from any DataFrame (e.g. world_2015.plot?) or check out this tutorial
  • pandas plots can be further customized using matplotlib functions and methods

Statistical Plots with Seaborn

  • While pandas plots are convenient for simple visualizations, they are pretty limited (unless you want to customize them with a lot of additional matplotlib code)
  • The seaborn library also builds on matplotlib and integrates with pandas data structures, but is much more powerful
    • Specifically designed for statistical visualizations
    • Generate a wide variety of beautiful, customized plots with very little code

Types of Plots

Most seaborn plots fall into one of three main categories:

  • Relational
  • Distributions
  • Categorical

Relational Plots

  • Relational plots visualize relationships between two numeric variables
  • There are two types of relational plots in seaborn: scatter plots and line plots

relational

Distribution Plots

  • Distribution plots visualize how one or more variables are distributed
  • There are many types of distribution plots in seaborn — a few examples are shown below

distributions

Categorical Plots

  • Categorical plots visualize relationships between two variables where one of the variables (x- or y-axis) is categorical (divided into discrete groups)
  • There are many types of categorical plots in seaborn — a few examples are shown below

categorical

Getting Started

Let's import the seaborn library and give it the commonly used nickname sns:

In [6]:
import seaborn as sns

Switch to seaborn default aesthetics:

In [7]:
sns.set()
  • This will affect all matplotlib-based plots that are created after you run this command, including those generated by pandas
  • In the latest version of seaborn (0.11) this function has been renamed to set_theme()

Let's re-create our scatter plot from earlier using seaborn's relplot() function for relational plots

  • relplot() creates either scatter plot or a line plot (default is scatter)
In [8]:
sns.relplot(data=world_2015, x='gdp_per_capita', y='life_expectancy');

Semantic Mapping

We can easily enrich this plot with additional information from our data by mapping other variables to visual properties such as colour and size

Let's colour each point by region:

In [9]:
sns.relplot(data=world_2015, x='gdp_per_capita', y='life_expectancy', hue='region');
  • If the above code caused an error, this is likely due to a newer version of matplotlib installed on your computer which is causing a conflict with relplot()
  • For an inelegant but easy and quick fix, replace the hue='region' keyword argument as shown below:
sns.relplot(data=world_2015, x='gdp_per_capita', y='life_expectancy', 
            hue=world_2015['region'].values);
  • For a proper fix that will allow the original hue='region' keyword argument to work:
    • Open a console window (on Windows: Anaconda prompt, on Mac: Terminal) and run the following command
      conda install matplotlib=3.2
    • After the installation is finished, restart the kernel in your Jupyter notebook and run all cells

Exercise 2.1

a) Use relplot() to create another scatter plot of life_expectancy vs. gdp_per_capita from world_2015, in which the points are coloured by income_group instead of region.

b) Add the keyword argument aspect=1.5 to the relplot() function call. How does the plot change?

Customize Axes

  • Since the relationship between life_expectancy and gdp_per_capita appears to be log-linear, let's set the x-axis to log scale
  • We'll also add a title
In [10]:
g = sns.relplot(data=world_2015, x='gdp_per_capita', y='life_expectancy', hue='region')
g.set(xscale='log', title='Life Expectancy vs. GDP per Capita in 2015');

Add Another Semantic Mapping

We can customize our scatter plot to be a "bubble plot", where the size of each marker is proportional to one of the variables in the data

Let's make the markers proportional to population size:

  • Keyword argument size='pop_millions' tells relplot() which variable to use
  • Keyword argument sizes=(40, 400) customizes the range of marker sizes to use
  • Keyword argument alpha=0.8 adds some transparency so it's easier to see overlapping markers
In [11]:
g = sns.relplot(data=world_2015, x='gdp_per_capita', y='life_expectancy', hue='region',
                size='pop_millions', sizes=(40, 400), alpha=0.8)
g.set(xscale='log', title='Life Expectancy vs. GDP per Capita in 2015');

We've visualized four variables (gdp_per_capita, life_expectancy, region, and pop_millions) in this single two-dimensional plot!

Facets

  • In the previous examples, we saw how the categorical variable region could be visualized by mapping it to colours
  • Another way to visualize this information is to split the plot into multiple subplots, arranged in a grid
    • These are known as facets

Let's start with a simpler version of our plot:

In [12]:
g = sns.relplot(data=world_2015, x='gdp_per_capita', y='life_expectancy', hue='region')
g.set(xscale='log');

Instead of mapping region to colours, let's now map it to facets using the col='region' keyword argument:

In [13]:
g = sns.relplot(data=world_2015, x='gdp_per_capita', y='life_expectancy', col='region')
g.set(xscale='log');
  • The subplots are very tiny because they're all squished into one row.
  • We can wrap them into a couple of rows and also customize the facet height so they're easier to read
In [14]:
g = sns.relplot(data=world_2015, x='gdp_per_capita', y='life_expectancy', col='region',
                col_wrap=3, height=3)
g.set(xscale='log');

We can also visualize the income groups by mapping them to colours:

In [15]:
g = sns.relplot(data=world_2015, x='gdp_per_capita', y='life_expectancy', col='region',
                col_wrap=3, height=3, hue='income_group')
g.set(xscale='log');

We can use the hue_order keyword argument to make sure the income groups are ordered properly:

In [16]:
income_order= ['Low', 'Lower middle', 'Upper middle', 'High']
g = sns.relplot(data=world_2015, x='gdp_per_capita', y='life_expectancy', col='region',
                col_wrap=3, height=3, hue='income_group', hue_order=income_order)
g.set(xscale='log');

Instead of manually specifying hue_order each time, we could instead convert the income_order column to a Categorical data type (and similarly for the other categorical variables: country, region, and sub-region). This would ensure the categories are automatically plotted in the correct order.

Statistical Transformations

  • So far we've visualized raw values in our DataFrame
  • With seaborn, we can also create plots which perform statistical transformations behind the scences, calculating new values to plot

Returning to our world DataFrame, which contains data for all years, recall that we can use grouping and aggregation to compute the total world population in each year:

In [17]:
world.groupby('year', as_index=False)['pop_millions'].sum()
Out[17]:
year pop_millions
0 1950 2521.5914
1 1955 2755.4391
2 1960 3014.5238
3 1965 3317.6620
4 1970 3676.8109
5 1975 4052.1130
6 1980 4428.6840
7 1985 4841.1945
8 1990 5294.2122
9 1995 5714.3521
10 2000 6101.9393
11 2005 6495.9793
12 2010 6918.4071
13 2015 7345.2106
  • We can visualize this result directly from the raw data in world using the relplot() function
  • We'll create a line plot, which is well suited to showing how a variable changes over time
  • The estimator='sum' keyword argument tells relplot() how to aggregate the data behind the scenes
  • The ci=None keyword argument tells relplot() to omit the 95% confidence interval which is included by default
In [18]:
sns.relplot(data=world, x='year', y='pop_millions', kind='line', 
            estimator='sum', ci=None);

Now let's see how the population has grown in each income group over time

  • We can map the income_group variable to the colour and also to the line style, to make it easier to distinguish the lines
In [19]:
sns.relplot(data=world, x='year', y='pop_millions', hue='income_group', hue_order=income_order,
            style='income_group', kind='line', estimator='sum', ci=None);

And we can use facets to see the population growth of each income group within each region:

In [20]:
sns.relplot(data=world, x='year', y='pop_millions', hue='income_group', hue_order=income_order,
            style='income_group', kind='line', estimator='sum', ci=None, col='region',
            col_wrap=3, height=3);

Exercise 2.2

Use relplot() to create a plot similar to the previous example, but plotting life_expectancy on the y-axis instead of pop_millions and aggregating with the mean instead of the sum.

  • We want to aggregate with the mean instead of the sum, so you'll need to use the keyword argument estimator='mean'.
  • Other aspects of the plot are the same as the previous example: use the world DataFrame, year on the x-axis, income_group maps to line colour and style, and facetting on region.

Bonus: Do you spot anything strange in the subplot for the "Americas" region? How could you investigate this using the techniques we learned in the Intro to Pandas lesson?

Bonus: Figure-Level vs. Axes-Level Functions

  • The relplot() function is a "figure-level" function which creates a figure and one or more axes for the facets (if any)
  • To draw the actual scatter plots and line plots on each set of axes, it calls an "axes-level" function:
    • sns.scatterplot() for scatter plots
    • sns.lineplot() for line plots

functions

  • The estimator and ci keyword arguments in our previous example are specific to sns.lineplot(), so they don't appear when you look at the documentation for sns.relplot()
  • You can look at the documentation for sns.lineplot() and sns.scatterplot() for details specific to these functions, and similarly for other axes-level functions

To learn more about figure-level and axes-level functions, check out this tutorial

Bonus: Long vs. Wide Data

Most seaborn plotting functions are designed for data tables that are in long-form, rather than wide-form

long_vs_wide

In a long-form data table:

  • Each variable is a column
  • Each observation is a row

Our world data is in long-form. Let's take a subset with just the country, year, and population variables. This table contains fewer variables but is still in long-form.

In [21]:
pop_long = world[['country', 'year', 'population']]
pop_long.head()
Out[21]:
country year population
0 Afghanistan 1950 7750000
1 Afghanistan 1955 8270000
2 Afghanistan 1960 9000000
3 Afghanistan 1965 9940000
4 Afghanistan 1970 11100000

In a wide-form data table, the columns and rows contain levels of different variables. We can reorganize pop_long in a couple of different ways to create a wide-form table, for example:

In [22]:
pop_wide = pop_long.pivot(index='year', columns='country', values='population')
pop_wide.head()
Out[22]:
country Afghanistan Albania Algeria Angola Antigua and Barbuda Argentina Armenia Australia Austria Azerbaijan ... United Kingdom United States Uruguay Uzbekistan Vanuatu Venezuela Vietnam Yemen Zambia Zimbabwe
year
1950 7750000 1260000 8870000 4550000 46300 17200000 1350000 8180000 6940000 2930000 ... 50600000 159000000 2240000 6260000 47700 5480000 24800000 4400000 2310000 2750000
1955 8270000 1420000 9830000 5120000 52900 18900000 1560000 9210000 6950000 3330000 ... 51100000 172000000 2370000 7300000 54900 6760000 28100000 4770000 2630000 3200000
1960 9000000 1640000 11100000 5640000 55300 20600000 1870000 10300000 7070000 3900000 ... 52400000 187000000 2540000 8550000 63700 8150000 32700000 5170000 3040000 3750000
1965 9940000 1900000 12600000 6200000 60800 22300000 2210000 11400000 7310000 4590000 ... 54300000 200000000 2690000 10100000 74300 9820000 37900000 5640000 3560000 4410000
1970 11100000 2150000 14600000 6780000 67100 24000000 2530000 12800000 7520000 5180000 ... 55600000 210000000 2810000 12100000 85400 11600000 43400000 6190000 4170000 5180000

5 rows × 178 columns

pop_wide contains the same data as pop_long, but the variables do not correspond to the columns, and each row contains multiple observations.

To learn more about long-form vs. wide-form data, check out this tutorial.

Categorical Plot

  • As a quick example of a categorical plot, let's use the catplot() function to create a bar plot of mean life expectancy in each region in 2015
  • Similar to the line plots in our previous examples, catplot() will group and aggregate the data behind the scenes
  • When we omit the estimator keyword argument, it defaults to aggregating with the mean
In [23]:
g = sns.catplot(data=world_2015, x='region', y='life_expectancy', kind='bar', aspect=1.5)
g.set(title='Mean Life Expectancy by Region in 2015');

Interactive Plots with Plotly

  • As a very brief intro to plotly, we will look at a few examples with the Plotly Express library
  • Plotly Express is a high-level interface to create plotly-based plots, similar to using seaborn as a high-level interface to matplotlib-based plots
  • Plotly Express syntax is slightly different from seaborn, but it uses the same concepts such as semantic mapping and facets

First we'll import Plotly Express and give it the commonly used nickname px:

In [24]:
import plotly.express as px

Let's recreate one of our previous scatter plots:

In [25]:
px.scatter(data_frame=world_2015, x='gdp_per_capita', y='life_expectancy', color='region',
           size='pop_millions', size_max=30, log_x=True, hover_data=['country'],
           title='Life Expectancy vs. GDP per Capita in 2015')
  • One big difference in this plot compared to the seaborn version is that you can hover over any point to see the data values (country,, region, gdp_per_capita, life_expectancy, pop_millions)
  • We can also use the toolbar at the top right of the figure to zoom in and out, and pan around

We can create a facet plot as well:

In [26]:
px.scatter(data_frame=world_2015, x='gdp_per_capita', y='life_expectancy', 
           facet_col='region', facet_col_wrap=3,
           color='income_group', category_orders={'income_group' : income_order},
           log_x=True, hover_data=['country'],
           title='Life Expectancy vs. GDP per Capita in 2015')

With Plotly Express, we can easily add another variable to our plot as an animation frame

  • We can play and pause the animation, and also click on a specific year to see the plot for that year
In [27]:
px.scatter(data_frame=world, x='gdp_per_capita', y='life_expectancy', color='region',
           size='pop_millions', size_max=30, log_x=True, range_y=(20, 90),
           title='Life Expectancy vs. GDP per Capita (1950-2015)',
           animation_frame='year')
  • With Plotly Express, you can easily create interactive versions of many (but not all) of the standard seaborn plots
  • For more advanced statistical plots, seaborn is usually a better option

To learn more about Plotly Express, check out this tutorial.

Thank You!